Introduction

Overview and Motivation

In a world and time where depression, suicides, epidemics, wars have become common words, people seem to get more and more tensed about the future. Worry and anxiety seem to rule one’s heart and more and more people are getting mentally and physically ill. As we are not medical students and experts in this field, we thought of ways to increase people’s Happiness. That sounds logical but we tend to be unaware of the factors that would make people more happy.

We wanted to understand what factors would have a positive impact on Happiness. Therefore we found interest in a particular study that explained Happiness through several factors. This work provided a happiness score per country and thus indicating a rank per country as well. First of all we wanted to use that study, but as it was already presenting the desired results, we decided to start the study from zero and seek for the raw datasets of these factors and keep only the Happiness score of the different world countries. We will use some of the variables from the study as well as proposing new ones that we have chosen. This topic was interesting for us to work on as well as to develop our skills in R.

Overall, we will focus on a global scale because we will compare and contrast the different results that a specific region can give us.

While the reports on the subject have been constant during the last decade, we will set up this assignment in the year 2017 because of two reasons : first, the data and information we got from different sources are more complete for 2017 than any other years. Second, after these years, the COVID-19 crisis emerged worldwide, and the distortion in the data from 2020 onwards could be important due to the constant regulations that societies had. Although there are always international problems that harm every part of the world, this was something of a big scale that will remain in our history. Because we want to avoid any significant bias, we prefer to focus our project on the year 2017, which is a year with no major happenings.

This analysis’ main objective is to evaluate each variable’s effectiveness in the final measurement of Happiness. Moreover, we want to highlight the most important factors impacting people and explain them. Finally, we want to discover which factors are more prone to increase the final score and we want to search for a tendency in each continent by separating the countries into groups.

Research Questions

We have thought of different questions that we wanted our work to answer :

  1. To what factors is Happiness correlated to ?

  2. Which socio-economic factors do country authorities need to focus on in order to increase people’s Happiness ?

  3. How come that the main factors do not have the same impact in different continents ?



Data

For this part of the project, we decided to present the different datasets in a table form and to do that, we used the function kablefrom the package kableExtra. This gives us a table with each variable and its definition.

2.1 World Happiness Report Dataset

This dataset is the main one that gave us an idea of the type of project we wanted to have. However, as this dataset presented almost every results, we decided to get inspiration from it and search the raw datasets and then produce the Happiness Score ourselves. There are also some variables we decided to omit, such as Whisker.high, Whisker.low, Freedom, Generosity and Government Corruption.

Below are listed the variables that we kept. They are Country, Region, Happiness Rank and Happiness Score. We decided to keep the score coming from this dataset because its results are coming from a survey that was done in 2017. As it is difficult to reproduce such survey results, we kept the original country scores. To choose the variables we wanted to keep, we simply subsetted the columns that we were interested in by using the brackets [ ] and putting the columns’ number we want to keep.


Variable Definition
Country Name of the country
Happiness Rank Rank of the country based on the Happiness Score
Happiness Score A metric measured in 2017 by asking the sampled people the question: How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.

Source : https://www.kaggle.com/datasets/unsdsn/world-happiness

2.2 Socio-Economic Datasets coming from the World Bank Open Data

For the following dataset from the World Bank Open Data, the same data wrangling skills have been applied. First of all, following a constant error message and the uselessness of some variables for our project, we had to remove some columns from the original datasets. Then we decided to only keep the name of the countries and the values corresponding to each factor for 2017 once again with the use of brackets and select the columns we want. Finally, we had to rename the columns for better understanding purposes using the function colnames.

2.2.1 Life Expectancy Dataset

This dataset presents of life expectancy at birth in years from 1960 to 2021 in the world. We wanted to focus on the values themselves so we did not need the Country Code, Indicator Name and Indicator Code. So what remains is the name of the country and the corresponding value for 2017.


Variable Definition
Country Name of the country
2017 Life Expectancy at birth (years) in 2017

Source : https://data.worldbank.org/indicator/SP.DYN.LE00.IN?most_recent_year_desc=false

2.2.2 Education Dataset

This dataset focuses on the Government spendings for Education throughout the world from 1960 to 2021.


Variable Definition
Country Name of the country
2017 Government spendings on Education in 2017

Source : https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?end=2021&most_recent_year_desc=false&start=2015

2.2.3 Economy Dataset

This dataset provides information on the GDP per capita and per country for period of time going from 1960 to 2021.


Variable Definition
Country Name of the country
2017 GDP per capita in 2017

Source : https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?end=2021&most_recent_year_desc=false&start=2015

2.2.4 Wages Gap Dataset

For this factor, we had first two datasets separately : the percentage of wages for male and the percentage of wages for female. However, what interested us was to understand the link between Happiness and the gap in male and female wages. Therefore, we collected both datasets and then computed the difference in the wages to have the gap in a different sheet that we did on Excel. Our last table then presents the percentage of the gap between male and female wages.


Variable Definition
Country Name of the country
2017 Percentage in wage gap between male and female in 2017

Sources : https://data.worldbank.org/indicator/SL.EMP.WORK.MA.ZS?end=2018&most_recent_year_desc=false&start=2017, https://data.worldbank.org/indicator/SL.EMP.WORK.FE.ZS?end=2018&most_recent_year_desc=false&start=2017

2.3 Air pollution Dataset

This dataset shows us the tonnes/capita of carbon dioxyde emissions in 2017. As for the other datasets, some unnecessary variables were removed in order to keep those that were interesting to us. The variables that we removed are Indicator, Subject, Measure, Frequency and Flag Codes. Therefore those that remain are Location (that we renamed Country), Time (renamed Year) and Value. The renaming is done once again through the function colnames. Moreover, the country variable was presented in a country code form. So we also had to transform the codes into the respective country name. For that, we had to install a package called countrycode and use its function with the same name.


Variable Definition
Country Name of the country
Year Year 2017
Value Value corresponding to the tonnes/capita of CO2 emissions

Source : https://data.oecd.org/air/air-and-ghg-emissions.htm

2.4 World Sustainability Dataset

This dataset provided insightful information on many socio-economical as well as ecological factors that we could use in our project in order to possibly explain Happiness. However, as the dataset presents a lot of variables (62) and we would not use all of them here as well, we chose to reduce its number and focus on two variables that we judged interesting. The chosen ones are listed in the table below. We also subsetted the columns with the brackets.


Variable Definition
Country Name of the country
Year Year the value was observed
Internet Proportion of population covered with at least a 3G mobile network (%)
Renewable Energy Renewable energy consumption (% of total final energy consumption)

Source : https://www.kaggle.com/datasets/truecue/worldsustainabilitydataset

2.5 Percentage of forests

This dataset provides information on the percentage of forests in a particular country. This dataset was interesting for us, because we were seeking data on nature and ecological factors and this one seemed unusual yet intriguing for us to link to Happiness.


Variable Definition
Country Name of the country
2017 Percentage of forests in 2017

Source : https://data.worldbank.org/indicator/AG.LND.FRST.ZS



Exploratory Data Analysis

Let us now explore our data ! Indeed we want to understand the impact of our different factors on Happiness. For that, we decided to merge all the variables we have, to the original Happiness score, to have an idea of the relationships between variables. We then create one dataset with all the desired variables, which is called HS_172. For that, we had to first transform the values of our datasets in “numeric” type because they initinally were in “character” type and we could not work with this type. We used the function as.numeric to do that for every factor. Then we used the function merge in order to create the dataset HS_172 mentioned earlier and merged our different factors according to their variable Country. After that, we had to rename the columns according to the name of the factor merged to the dataset with the function colnames. Then we were faced with some NA values and to remove them we used the function na.omit. However, it did not work because there still were some empty observations. Thus, we created a loop linked to a conditional using the function ìfelse to delete those blank boxes. Finally, we remarked that in our factors’ datasets, the country name for “Russia” was under the name “Russian Federation”. In order to avoid problems during the EDA, especially in the case of maps, we decided to change the name to “Russia”. For that, we again used the brackets.
Finally, in order to represent the merged dataset, we created a table with the function kable and specifying the style with kable_styling and the size of the table with scroll_box coming from the package kableExtra. This table presents the values rounded to 2 digits thanks to the round function.

Summary Table for 2017
Country Happiness.Rank Happiness.Score Education Forest_per GDP Life_exp Wages Internet EnergyE Pollution
Albania 109 4.64 3.61 28.79 4531.02 78.33 -3.74 62.40 37.22 1.53
Algeria 53 5.87 6.51 0.82 4109.70 76.50 -7.53 47.69 0.14 3.15
Angola 140 3.80 2.47 54.76 2313.22 60.38 19.04 32.00 56.18 0.56
Argentina 24 6.60 5.45 10.56 14613.04 76.37 -6.68 74.29 10.37 3.95
Armenia 121 4.38 2.71 11.56 3914.53 74.80 -2.29 64.74 12.56 1.75
Australia 10 7.28 5.14 17.42 53934.25 82.50 -8.30 86.55 9.69 15.97
Austria 13 7.01 5.37 47.12 47429.16 81.64 -4.77 87.94 33.96 7.30
Azerbaijan 85 5.23 2.47 13.27 4147.09 72.69 8.84 79.00 1.91 3.12
Bahrain 41 6.09 2.32 0.82 23742.94 77.03 0.46 95.88 0.00 19.92
Belarus 67 5.57 4.79 42.98 5785.67 74.13 -3.10 74.44 7.29 5.71
Belgium 17 6.89 6.43 22.76 44198.48 81.49 -6.73 87.68 9.64 7.94
Benin 143 3.66 3.54 29.13 1136.59 61.17 11.01 13.30 45.38 0.60
Bolivia 58 5.82 8.66 47.53 3351.12 70.94 5.13 43.83 7.46 1.80
Botswana 142 3.77 7.30 27.54 7296.09 68.81 3.36 41.41 28.37 3.33
Brazil 22 6.64 6.32 59.83 9928.68 75.46 -10.25 67.47 45.44 2.09
Bulgaria 105 4.71 4.08 35.50 8366.29 74.81 -5.36 63.41 17.08 6.03
Cambodia 129 4.17 3.20 48.35 1385.26 69.29 11.25 32.90 61.47 0.65
Cameroon 107 4.70 3.06 43.38 1469.45 58.51 16.54 23.20 80.34 0.26
Chile 20 6.65 5.42 24.00 14962.56 79.91 -1.67 82.33 24.11 4.68
China 79 5.27 3.67 22.74 8816.99 76.47 1.74 54.30 12.86 6.68
Colombia 36 6.36 4.54 53.84 6376.71 76.92 -3.67 62.26 32.53 1.45
Costa Rica 12 7.08 7.07 58.48 12225.57 79.91 -5.48 71.58 36.20 1.55
Croatia 77 5.29 3.85 34.52 13629.29 77.83 -4.43 67.10 29.80 3.91
Cyprus 65 5.62 5.72 18.68 26608.88 80.67 -5.82 80.74 11.08 7.53
Denmark 2 7.52 7.75 15.64 57610.10 81.10 -5.82 97.10 35.51 5.51
Dominican Republic 86 5.23 3.92 43.88 7609.35 73.69 -21.99 67.57 16.98 2.05
El Salvador 45 6.00 3.73 28.83 3910.25 72.87 12.87 33.82 24.99 0.94
Estonia 66 5.61 4.96 57.04 20437.77 78.09 -7.19 88.10 27.03 12.51
Ethiopia 119 4.46 5.65 15.32 768.52 65.87 5.13 18.62 90.27 0.12
Finland 5 7.47 6.36 73.73 46412.14 81.63 -7.97 87.47 44.48 7.70
France 31 6.44 5.45 31.05 38781.05 82.58 -5.92 80.50 14.14 4.64
Gabon 118 4.47 3.33 91.46 7230.43 65.84 6.35 50.32 90.12 1.22
Georgia 125 4.29 3.57 40.62 4357.00 73.41 -0.44 59.71 28.03 2.45
Germany 16 6.95 4.87 32.68 44652.59 80.99 -4.96 84.39 15.22 8.70
Ghana 131 4.12 3.53 35.00 2074.29 63.46 15.54 37.88 44.86 0.50
Greece 87 5.23 3.48 30.27 18582.09 81.29 -8.31 69.89 16.38 5.87
Guatemala 29 6.45 2.95 33.25 4454.05 73.81 13.88 40.70 65.07 0.95
Haiti 145 3.60 1.50 12.94 1369.06 63.29 15.54 31.00 76.17 0.30
Honduras 91 5.18 4.94 57.40 2453.73 74.90 9.58 32.14 46.24 0.90
Hungary 75 5.32 4.61 22.54 14623.70 75.82 -3.80 76.75 14.51 4.97
Iceland 3 7.50 7.58 0.49 72010.15 82.66 -7.72 98.26 76.83 4.87
India 122 4.32 4.31 24.00 1980.67 69.17 1.03 18.20 32.21 1.63
Indonesia 81 5.26 2.67 50.04 3837.58 71.28 11.72 32.34 25.43 1.82
Ireland 15 6.98 3.51 11.18 69774.03 82.16 -13.95 84.11 9.94 7.47
Israel 11 7.21 6.06 6.47 40774.13 82.55 -6.35 81.58 3.85 7.34
Italy 48 5.96 4.04 31.80 32406.72 82.95 -10.67 63.08 16.43 5.36
Jamaica 76 5.31 5.26 54.04 5070.10 74.27 -11.44 55.07 10.72 2.35
Japan 51 5.92 3.13 68.41 38834.05 84.10 -2.71 91.73 6.97 8.87
Jordan 74 5.34 3.23 1.10 4231.52 74.29 -13.68 66.79 5.26 2.49
Kazakhstan 60 5.82 2.75 1.25 9247.58 72.95 -1.72 76.43 1.99 11.42
Kenya 112 4.55 4.96 6.29 1633.49 65.91 15.87 17.83 71.32 0.34
Kuwait 39 6.11 6.37 0.35 29759.47 75.31 -1.56 98.00 0.03 21.61
Latvia 54 5.85 4.37 54.63 15695.12 74.63 -4.62 80.11 42.60 3.44
Lebanon 88 5.22 2.13 13.83 7776.03 78.83 -28.37 78.18 3.95 3.95
Lithuania 52 5.90 3.81 35.06 16885.41 75.48 -5.38 77.62 33.78 3.81
Luxembourg 18 6.86 3.49 34.45 110193.21 82.10 -1.98 97.36 15.33 14.47
Malaysia 42 6.08 4.68 58.63 10259.30 75.83 0.76 80.14 5.22 6.78
Malta 27 6.53 4.65 1.31 28857.02 82.35 -10.83 81.01 7.25 3.25
Mexico 25 6.58 4.52 33.99 9287.85 74.95 0.27 63.85 9.54 3.62
Moldova 56 5.84 5.62 11.75 3509.69 71.72 -8.98 76.12 26.06 2.73
Mongolia 100 4.95 4.07 9.10 3687.10 69.51 -6.81 23.71 3.60 6.19
Morocco 84 5.24 5.12 12.80 3035.45 76.22 13.74 61.76 10.42 1.63
Mozambique 113 4.55 5.51 47.57 461.41 59.31 18.48 7.80 68.51 0.22
Namibia 111 4.57 9.71 8.32 5367.11 63.02 10.73 36.84 29.41 1.59
Nepal 99 4.96 4.77 41.59 1048.45 70.17 23.00 21.40 76.41 0.37
Netherlands 6 7.38 5.18 10.89 48675.22 81.76 -6.53 93.20 6.36 9.07
New Zealand 8 7.31 6.26 37.41 42925.00 81.66 -6.77 90.81 30.45 6.67
Nicaragua 43 6.07 4.36 30.81 2159.16 74.07 7.81 27.86 48.87 0.80
Niger 135 4.03 2.58 0.88 517.77 61.60 5.50 10.22 79.44 0.08
Norway 1 7.54 7.91 33.29 75496.75 82.61 -4.17 96.36 61.11 7.20
Pakistan 80 5.27 2.90 4.99 1631.53 66.95 18.41 13.78 42.09 0.88
Panama 30 6.45 2.88 57.31 15146.41 78.15 -5.06 59.95 23.58 2.29
Paraguay 70 5.49 3.09 42.64 5678.87 73.99 3.10 61.08 60.11 1.12
Peru 63 5.72 3.93 56.90 6710.51 76.29 10.14 50.45 27.60 1.58
Poland 46 5.97 4.56 30.85 13864.68 77.75 -7.83 75.99 11.19 7.96
Portugal 89 5.20 5.02 36.15 21490.43 81.42 -8.38 73.79 24.42 4.93
Qatar 35 6.38 2.97 0.00 59124.87 79.98 -0.02 97.39 0.00 30.55
Romania 57 5.82 3.10 30.12 10807.01 75.31 -3.16 63.75 23.38 3.62
Rwanda 151 3.47 3.13 11.07 772.29 68.34 20.50 21.77 86.91 0.09
Saudi Arabia 37 6.34 8.02 0.45 20802.46 74.87 -3.90 94.18 0.02 15.75
Senegal 115 4.53 4.62 42.53 1361.70 67.38 11.46 29.64 39.23 0.51
Serbia 73 5.39 3.71 31.12 6292.54 75.54 -7.45 70.33 20.09 6.63
Singapore 26 6.57 2.77 22.63 61150.73 83.10 -9.13 84.45 0.69 8.43
Slovenia 62 5.76 4.78 61.77 23514.03 81.03 -5.39 78.89 19.67 6.64
Spain 34 6.40 4.21 37.15 28170.17 83.28 -7.90 84.60 15.68 5.49
Sweden 9 7.28 7.57 68.69 53791.51 82.41 -7.81 93.01 52.86 3.65
Switzerland 4 7.49 4.95 31.86 83352.09 83.55 -3.82 89.69 25.00 4.37
Tajikistan 96 5.04 5.84 3.04 848.67 70.65 5.09 21.96 41.69 0.59
Tanzania 153 3.35 4.43 53.23 1004.91 64.48 7.61 16.00 83.83 0.18
Thailand 32 6.42 3.36 39.11 6593.82 76.68 0.08 52.89 22.27 3.53
Togo 150 3.49 3.76 22.40 830.75 60.49 23.74 12.36 77.74 0.16
Ukraine 132 4.10 5.42 16.69 2638.33 71.78 -4.13 58.89 6.47 3.82
United Kingdom 19 6.71 5.38 13.08 40857.76 81.26 -8.23 90.42 9.72 5.45
United States 14 6.99 5.11 33.87 59914.78 78.54 -1.80 87.27 9.92 14.64
Uruguay 28 6.45 4.47 11.24 18690.89 77.63 -6.80 70.32 60.68 1.68
Uzbekistan 47 5.97 5.03 8.20 1916.76 71.39 1.41 48.70 1.77 3.23
Vietnam 94 5.07 4.09 46.00 2974.12 75.24 9.91 58.14 31.98 1.96


Graph 1 : Maps

The factors we decided to concentrate on in this study are the following : Life expectancy at birth, GDP per capita, Government spendings in Education, Forests percentage, CO2 Emissions, Access to Internet, Renewable energy consumption and Gender Wage Gap.

First of all we wanted to present those factors in a visually interesting way in order to understand the situation worldwide as well as the interaction between those factors. For that, we created world maps representing each factors. To create those maps, we had to download a “.shp” file with the world countries data online. Then we created a data frame corresponding to the data of the previously mentioned file. After that came a sequence of computations using the function left_join in order to add our factors’ values to our new data frame. Finally, to create the interactive maps, we used the package plotly.

Life Expectancy

This map illustrates life expectancy of different countries in 2017. The average life expectancy in the world is 72.39 and varies from 52.24 to 84.22 years. A stark difference is seen in African countries where we find the lowest rates and an average of 62.18 years. The countries with the highest life expectancy are Hong kong with 84.22 years and Japan with 84.09 years.

Economy
Education

This map shows public spending on education as a percentage of GDP per country. Which measures the priority given by governments to educational services and institutions. The highest investment on education was seen in Greenland with 11.3 % and the lowest was 1.19 % in Venezuela.

Gender Wage Gap

The map shows the gender wage gap across countries. The country that has lowest rate in dark blue colour is Syria with -37.73%, while the country that has the highest rate is Mauritania with 28.52%, followed by Nepal with 23%, Papua New Guinea 19.67%, Congo 19.32% and Pakistan 18.41%. Most of the countries with higher Gender Wage Gap are in Africa and Asia.

Air Pollution

The map shows the percentage of CO2 emissions in the world across countries. The countries that have the highest rates in light grey colour are Australia with 15.97%, Saudi Arabia 15.75%, Canada 15.29% and United States 14.64%. The countries with lowest rates in black colour are some African countries such as Niger with 0.08%, Madagascar 0.13%, Mozambique 0.22% and Cameroon 0.26%. Also countries in the American Central have low rates as Nicaragua with 0.8%, Honduras 0.9%, El Salvador 0.94% and Guatemala 0.95%. Haiti in the Caribbean has 0.3%. Countries in Asia such as Nepal have 0.37%, Bangladesh 0.54%, Tajikistan 0.59%, Pakistan 0.88% and Cambodia 0.65%.

Internet

The map shows the proportion of the population that have access to the internet with at least a 3G mobile network across the countries. The highest rates in dark red are Iceland with 98.26%, followed by Denmark 97.01%, Norway 96.36%, Saudi Arabia 94.18%, Netherlands 93.02%, Sweden 93.01%, Canada 92.07% and United kingdom 90.42%. Indicating that many developed countries have most of the population covered with the internet. While the lowest rates in dark blue are African countries such as Eritrea with 1.31%, Chad 6.5%, Mozambique 7.8%, Niger 10.22%. Also Pakistan has a lower rate 13.78% and India 18.2%.

Renewable Energy

The map shows the renewable energy consumption in the world, as a percentage of the energy consumption total. Africa concentrates the countries that have the highest rates of energy consumption in the world. Uganda has the lightest colour with 90.66%, Ethiopia 90.27% and Gabon 90.12%. While the countries that have the lowest rates are still in Africa in dark blue colour. Oman has 0%, Saudi Arabia 0.02% and Algeria 0.14%.

Percentage of forests

The map shows the percentage of forests in the countries. Greenland has the darkest green colour with almost 0% (0.000535997%) of forest, followed by some countries in Africa such as Egypt with 0.05% and Oman with 0.01%. Papua New guinea has the lightest colour with 79.4%, followed by Finland with 73.73% and Sweden with 68.68%.

Graph 2 : The 5 best and worst countries


It’s important to see how the variables from the first data set influence the final score of each country, because through this we can infer how relevant they are for the people.

The two bar plots that are in horizontal show us how the contribution of each variable is distributed. From the first graph, which is a focus on the 5 countries with the best Happiness score, we can remark consistency in most of the contributions of the variables. Furthermore, it is possible to say that the trust on the Government and the Corruption are the ones that suffer more changes. In contrast, in the second graph, which focuses on the 5 worst countries again in terms of Happiness score, we can easily observe disparity and a big difference in the average contribution of each variable. Notably, none of the variables has a consistency in the countries and the one that shpws the most change is the Dystopia.

The steps to plot these 2 graphs were the following : First, we did a little bit of data wrangling after uploading the main dataset (Happiness Score dataset). We had to subset the first 3 columns that interested us using the brackets and change their names. Then, correct the datasets because all the columns were characters; additionally, we add to the countries their respective continent with the help of the package countrycode. Moreover, we chose the countries with the best score and then transposed all the data frame to make easier the process of making the graph. Finally, we use the function barplot to make the accumulated barplot. We repeated the same process for the 5 worst countries.


Graph 3 : Continents Representation


We also wanted to group the countries into continents and present results based on those groups. In order to do that, we again used the package countrycode, which gives us a new column with the corresponding continent for each country. Our goal was to show the average Happiness score for each continent. Therefore we used the function filter to have the continents separated and we then computed the average Happiness score per continent. We finally created a new dataset with all those values.

Barplot

From the below bar graph created from the package highcharter, we can visualize the proportion in Happiness score by continent. Oceania represent the happiest continent with the highest value of 7.3. However, because Oceania only consists of two countries Australia and New Zealand, whose respective Happiness scores are above the average, the results for this continent have to be taken cautiously. Without considering this, Asia is the continent with the highest scores in terms of Happiness, followed by Europe and Americas. We notice that Africa has the lowest average Happiness score.

Boxplot

Before that, we have seen in a simple way how the Happiness Score was for each continent. Now to observe better the distribution of each observation in their respective continents and, in the same way, some important statistics of the regions, we used the boxplot. We applied the ggplot package, to add to the boxplot, points to visualize every observation. Likewise, ggtext was used to change the size of the title.

Specifically in the output of the graph, we demonstrate that there are few atypical observations inside the measurement of the Happiness score. Here, we decided to remove Oceania, because a boxplot of only 2 observations would not be useful. On the other hand, notice that from the other boxplots, Americas is the one with the most concentrated data, because it has the lowest range (Maximum Value – Minimum value), indicating that this continent maintains a trend without many fluctuations. Europe is positioned as the region with more dispersion in their data (because of their IQR), which tell us that countries there present notable differences in their results. Finally, as we have perceived in other graphs, Africa is located below the other ones due to their low Happiness Score.


Graph 4 : Relationships between variables


Let us now go on in our exploratory analysis by calling a basic scatterplot of all our variables.

When looking at this scatterplot, we want to look for any relationship with our experiment variable Happiness.Score. We notice that Life_exp, Wages, Internet and Pollution seem to be related in a way to the score of Happiness.

Let us then have a look at each variable that seem to be linked with the Happiness score. For that, we wanted to present the link in an interactive way and therefore, we used once again the plotly package.

Scatterplot 1

We start with Life Expectancy at birth.

We can see from the previous chart that the Happiness score seems to increase when the life expectancy at birth tends to increase as well.

Scatterplot 2

Then we have the link with the wage gap between male and female.

In this case, we find an interesting result too. Indeed, we tend to notice that with a high wage gap between gender (M-F), we see a low Happiness score. The converse is also true, with a low gap in wages, the Happiness score seems to increase. The wage gap is computed by by the difference between male wages and female wages. Interestingly, we see that countries where women salaries are slightly higher than men seem to be happier than the others.

Scatterplot 3

The next plot shows the relationship with the access to 3G Internet.

We observe a positive link. As more people have access to Internet, the Happiness score increases.

Scatterplot 4

Finally, we will study the link with the emissions of CO2, which corresponds to air pollution.

In this situation, the results are surprising, because we see from the plot, that with more emissions of CO2, so with more air pollution, people seem to be more happy, which is not logical at all. We will have to be careful with this variable when computing the regression.

We can maybe explain this by the fact that in most developed countries, the resources and technology used release more CO2 than in countries in Africa for example. Therefore, the development of a country shows its CO2 emissions and explains why some countries like Qatar, with a high pollution rate has a Happiness score that is higher than some European countries that are considered developed.


Then something that we will have to also be cautious about is the correlation between explanatory variables, that could lead to multicollinearity problems and thus provide us biased results. We notice correlation between the following variables : Life_exp and Internet, Life_exp and Pollution as well as Internet and Pollution.



Graph 5 : Happiness Score per country - World Map


The above world map coming from the package leaflet represents the average Happiness score per country in 2017. Norway is in the first position (7.537), followed closely by Denmark (7.521), Iceland (7.504), Switzerland (7.493) and Finland (7.468). They have the darkest color on the map.Their averages are so close that small changes can re-order the rankings from year to year. Haiti in Caribbean (3.602) and African countries, as Tanzania (3.348), Rwanda (3.470), Togo (3.494) and Benin (3.657), have the lowest level of Happiness compared to other countries in the world. They have the lightest color on the map.



Analysis

In this section, we will provide an analysis of our results and answer the questions of the first part.

Q1 : To what factors is Happiness correlated to ?

According to the above correlation plot of the data, which was created with the corrplot package, we see that Happiness (that we measure by the Happiness score) seems to be correlated with most of our factors. However, we are interested in the factors that are highly correlated to Happiness. Therefore, we can notice that the Happiness score is highly and positively correlated with Life expectancy at birth, GDP per capita and Internet. On the other hand, the happiness score is negatively correlated with Gender Wage Gap and Renewable Energy consumption.

There is a gap in average life expectancy at birth between different countries, most related to the amount of money the countries spend on its healthcare. The countries that spend more on healthcare tend to have higher average life expectancy and lowest infant mortality rate. Also mortality rate in adults and young adults is related to spending in security. So developed countries, such as Switzerland ranking 4 in the happiness score, (Hong Kong, Japan, Macao, Liechtenstein), have the largest spending in healthcare, with better hospitals, doctors and nutrition, than underdeveloped countries, as African countries, some with precarious situations.

GDP per capita has a strong relationship with rising Happiness. In countries with high GDP per capita, indicating a high economic health, as Switzerland, Norway, Iceland and Denmark, ranking 4, 1, 3 and 2 respectively, with lower income inequality and higher Happiness indices, people can have a better life quality and higher level of life satisfaction than in countries with lower GDP per capita. (…). GDP per capita is also associated with life expectancy.

In some countries in Central and South America, such as Mexico, Brazil and Argentina, people are happier even if the income inequality is high. Some people believe that things that make them happier tend to be things that money can’t buy, which means that income level is a factor that plays an important role but it does not necessarily explain people’s happiness, as GDP measures a country’s total income.}

Access to Internet has been found to be related to happiness. Access to internet help people to connect and communicate with other people around the world, work from home and study in a more productive way. Access to networks and other it things makes daily life easier and increase well-being. So it positively impact happiness. Norway (96.35%), Denmark (97.09%) and Iceland (98.25%), on the top 3 places in the Happiness score, have the highest proportion of population with at least 3G mobile network.

Q2 : Which socio-economic factors do country authorities need to focus on in order to increase people’s Happiness ?

To answer this question, we have computed a regression analysis in order to look at the importance and significance of our factors. In order to do so, we have used the function lm, which enables us to create a model and then use the function summary to get the summary table of the model. But to present it in a more nicer way, we used the function tab_model coming from the package sjPlot. This function creates a summary table of our regression in HTML.



  Happiness.Score
Predictors Estimates CI p
(Intercept) -1.98 -4.81 – 0.84 0.166
GDP 0.00 0.00 – 0.00 0.008
Education 0.13 0.05 – 0.21 0.002
Forest_per -0.00 -0.01 – 0.01 0.989
Life_exp 0.09 0.05 – 0.13 <0.001
EnergyE -0.00 -0.01 – 0.01 0.827
Internet 0.01 -0.00 – 0.02 0.156
Wages 0.01 -0.01 – 0.03 0.309
Pollution -0.01 -0.05 – 0.03 0.562
Observations 97
R2 / R2 adjusted 0.742 / 0.719


According to our regression model, we can see that most of the variables are unsignificant and therefore it would be better to remove them from the model and work with the significant ones. The unsignificant variables are Forest_per, EnergyE, Pollution, Internet and Wages because all of these variables have a corresponding p-value higher than 5%.

So what can be done now is to proceed and verify the previous statement with a model selection method. We have chosen to apply the stepwise model selection criteria by AIC. The function corresponding to that is stepAIC, coming from the MASS package.


## Start:  AIC=-96.91
## Happiness.Score ~ GDP + Education + Forest_per + Life_exp + EnergyE + 
##     Internet + Wages + Pollution
## 
##              Df Sum of Sq    RSS     AIC
## - Forest_per  1    0.0001 29.669 -98.908
## - EnergyE     1    0.0162 29.685 -98.855
## - Pollution   1    0.1143 29.783 -98.535
## - Wages       1    0.3523 30.021 -97.763
## <none>                    29.669 -96.908
## - Internet    1    0.6894 30.358 -96.680
## - GDP         1    2.5158 32.184 -91.013
## - Education   1    3.4002 33.069 -88.383
## - Life_exp    1    6.0014 35.670 -81.039
## 
## Step:  AIC=-98.91
## Happiness.Score ~ GDP + Education + Life_exp + EnergyE + Internet + 
##     Wages + Pollution
## 
##              Df Sum of Sq    RSS      AIC
## - EnergyE     1    0.0178 29.687 -100.850
## - Pollution   1    0.1167 29.785 -100.527
## - Wages       1    0.3540 30.023  -99.757
## <none>                    29.669  -98.908
## - Internet    1    0.7034 30.372  -98.635
## + Forest_per  1    0.0001 29.669  -96.908
## - GDP         1    2.5560 32.225  -92.892
## - Education   1    3.4090 33.078  -90.357
## - Life_exp    1    6.1040 35.773  -82.760
## 
## Step:  AIC=-100.85
## Happiness.Score ~ GDP + Education + Life_exp + Internet + Wages + 
##     Pollution
## 
##              Df Sum of Sq    RSS      AIC
## - Pollution   1    0.0994 29.786 -102.525
## - Wages       1    0.3364 30.023 -101.757
## <none>                    29.687 -100.850
## - Internet    1    0.7140 30.401 -100.544
## + EnergyE     1    0.0178 29.669  -98.908
## + Forest_per  1    0.0016 29.685  -98.855
## - GDP         1    2.8830 32.570  -93.859
## - Education   1    3.4081 33.095  -92.308
## - Life_exp    1    6.8044 36.491  -82.832
## 
## Step:  AIC=-102.53
## Happiness.Score ~ GDP + Education + Life_exp + Internet + Wages
## 
##              Df Sum of Sq    RSS      AIC
## - Wages       1    0.3058 30.092 -103.535
## <none>                    29.786 -102.525
## - Internet    1    0.6258 30.412 -102.508
## + Pollution   1    0.0994 29.687 -100.850
## + Forest_per  1    0.0029 29.783 -100.535
## + EnergyE     1    0.0005 29.785 -100.527
## - GDP         1    2.7991 32.585  -95.813
## - Education   1    3.6635 33.449  -93.274
## - Life_exp    1    7.4794 37.265  -82.795
## 
## Step:  AIC=-103.53
## Happiness.Score ~ GDP + Education + Life_exp + Internet
## 
##              Df Sum of Sq    RSS      AIC
## - Internet    1    0.4066 30.498 -104.233
## <none>                    30.092 -103.535
## + Wages       1    0.3058 29.786 -102.525
## + Pollution   1    0.0688 30.023 -101.757
## + EnergyE     1    0.0150 30.077 -101.583
## + Forest_per  1    0.0134 30.078 -101.578
## - GDP         1    3.4956 33.587  -94.875
## - Education   1    3.5597 33.651  -94.690
## - Life_exp    1    7.3645 37.456  -84.299
## 
## Step:  AIC=-104.23
## Happiness.Score ~ GDP + Education + Life_exp
## 
##              Df Sum of Sq    RSS      AIC
## <none>                    30.498 -104.233
## + Internet    1    0.4066 30.092 -103.535
## + Wages       1    0.0865 30.412 -102.508
## + EnergyE     1    0.0115 30.487 -102.270
## + Forest_per  1    0.0066 30.492 -102.254
## + Pollution   1    0.0061 30.492 -102.252
## - Education   1    3.7458 34.244  -94.996
## - GDP         1    4.7492 35.247  -92.195
## - Life_exp    1   19.6307 50.129  -58.031


## 
##  Shapiro-Wilk normality test
## 
## data:  resid(Quation)
## W = 0.99104, p-value = 0.7651


Moreover, we computed a Shapiro-Wilk normality test to check if the data comes from a normal sample with mean 0, which would tell us that our regression is appropriate. For that we have to keep the null hypothesis and fortunately, the p-value of the test is higher than 0.05. We cannot reject the null hypothesis. Therefore our observations come from a normal sample.

The model selection criteria shows the same result as before : we should keep Life_exp, GDP, and Education.

Moreover, we have previously said that some independant variables were correlated with each other and that we could therefore be facing a multicollinearity problem. In order to check this statement, we used the function vif to compute the variance inflation factor of our variables.

##        GDP  Education Forest_per   Life_exp    EnergyE   Internet      Wages 
##   3.037006   1.106020   1.193553   4.930371   2.386822   6.065067   2.596966 
##  Pollution 
##   2.628694

We know that VIF coefficients higher than 5 are considered a case of severe multicollinearity and thus, removing the corresponding variables from our model would benefit it. In our case, we see that Internet present strong multicollinearity issue. It confirms its removing from our previous model. Otherwise, the other ones seem fine even though Life_exp would be at the limit. However, we have decided as a group to go with the AIC model selection criteria and keep a model with the most significant variables.


  Happiness.Score
Predictors Estimates CI p
(Intercept) -2.42 -4.20 – -0.63 0.008
GDP 0.00 0.00 – 0.00 <0.001
Education 0.13 0.05 – 0.21 0.001
Life_exp 0.10 0.07 – 0.12 <0.001
Observations 97
R2 / R2 adjusted 0.735 / 0.726


The governments should focus in improving those factors first and many ways can be thought of. Let us first focus on Life Expectancy. Life expectancy has increased in the last decade across all countries in general, mainly due to the rise in health policies and lifestyle as a whole. However, economically-suffering countries tend to have trouble finding solutions to raise Life expectancy and this can be seen especially in Africa.

Then, improving the GDP per capita may certainly improve the Happiness score. Inevitably, the economical well-being of a country is necessary for its population to experience Happiness, but some countries are not evolving correctly or fast enough in order to achieve that. Increasing productivity or growth in the workforce are possible answers to the issue.

Finally, Education is an important factor impacting the Happiness in a country. Education determines the future economic state of a nation. Therefore, governments’ spendings in Education should increase in most of the countries.

Q3 : How come that the main factors do not have the same impact in different continents ?

We are going to continue on from the previous process of selection. To answer the question in this section, we will establish a new equation for each continent to see how these three factors have different weights in the Happiness Score. However, this time, we will not make the selection of significant variables because the distortion is considerable. With the creation of a function called TO_C, we will facilitate the computation of these equations. In addition, we will use the tab_model function again to show the linear regression results.


  Europe
Predictors Estimates CI p
(Intercept) 0.04 -5.43 – 5.52 0.987
GDP 0.02 0.01 – 0.03 <0.001
Education 0.22 0.07 – 0.36 0.005
Life_exp 0.06 -0.02 – 0.13 0.123
Observations 33
R2 / R2 adjusted 0.766 / 0.742


Firstly, inside its model, Europe possesses a variable with immense significance, the GDP per capita, because its corresponding p-value is far lower than 0.05. Thus, it is possible to indicate that its influence is noticeable. Likewise, the R-squared is considerably high (0.742). Therefore, the model explains observed data well. Life expectancy is the only variable that does not satisfy the p-value criteria.


  Asia
Predictors Estimates CI p
(Intercept) 1.19 -4.82 – 7.20 0.685
GDP 0.02 -0.00 – 0.04 0.063
Education 0.13 -0.03 – 0.28 0.099
Life_exp 0.05 -0.04 – 0.13 0.248
Observations 26
R2 / R2 adjusted 0.566 / 0.507


Alternatively, Asia has a relatively regular R-squared (0.507), which shows that the model is explained with a certain incertitude. Additionally, it has no significant variable that consistently describes the model.


  Americas
Predictors Estimates CI p
(Intercept) -6.29 -11.58 – -1.00 0.023
GDP 0.01 -0.01 – 0.03 0.380
Education 0.10 -0.05 – 0.25 0.193
Life_exp 0.16 0.08 – 0.23 <0.001
Observations 19
R2 / R2 adjusted 0.747 / 0.696


In contrast to Asia, the Americas have a more qualitative regression because its R-squared is almost equal to 70%. Furthermore, Americas have more unsignificant variables than Europe. Life expectancy is the only factor explaining with clarity the Happiness Score for the Americas according to its corresponding p-value.


  Africa
Predictors Estimates CI p
(Intercept) -0.04 -4.27 – 4.18 0.982
GDP -0.01 -0.18 – 0.16 0.892
Education 0.08 -0.12 – 0.28 0.387
Life_exp 0.06 -0.01 – 0.13 0.075
Observations 17
R2 / R2 adjusted 0.341 / 0.189


Regarding Africa, we can observe that the general p-value of the regression is higher than 0.05 (0.13), which means that we rejected the alternative hypothesis stating that this equation is functional. In other words, the variables cannot explain the regression. Moreover, its adjusted R-squared is too poor (0.189). This means that Africa will be the only continent that will suffer from changes in the regression because we want to see clearly and correctly the factors that influence this region.


## Start:  AIC=-13.18
## Happiness.Score ~ GDP + Education + Life_exp + EnergyE + Internet + 
##     Wages + Pollution
## 
##             Df Sum of Sq    RSS     AIC
## - EnergyE    1   0.04380 3.0987 -14.938
## - Pollution  1   0.04822 3.1031 -14.914
## - Life_exp   1   0.16567 3.2205 -14.282
## <none>                   3.0549 -13.180
## - Wages      1   0.53613 3.5910 -12.431
## - GDP        1   0.54583 3.6007 -12.386
## - Education  1   0.56017 3.6150 -12.318
## - Internet   1   0.96961 4.0245 -10.494
## 
## Step:  AIC=-14.94
## Happiness.Score ~ GDP + Education + Life_exp + Internet + Wages + 
##     Pollution
## 
##             Df Sum of Sq    RSS     AIC
## - Pollution  1   0.00726 3.1059 -16.898
## - Life_exp   1   0.12197 3.2206 -16.282
## <none>                   3.0987 -14.938
## - Education  1   0.54829 3.6470 -14.168
## - Wages      1   0.58721 3.6859 -13.988
## + EnergyE    1   0.04380 3.0549 -13.180
## - GDP        1   0.89479 3.9935 -12.626
## - Internet   1   1.40360 4.5023 -10.587
## 
## Step:  AIC=-16.9
## Happiness.Score ~ GDP + Education + Life_exp + Internet + Wages
## 
##             Df Sum of Sq    RSS     AIC
## - Life_exp   1   0.11630 3.2222 -18.273
## <none>                   3.1059 -16.898
## - Education  1   0.67025 3.7762 -15.576
## - Wages      1   0.68677 3.7927 -15.502
## + Pollution  1   0.00726 3.0987 -14.938
## + EnergyE    1   0.00284 3.1031 -14.914
## - GDP        1   1.11876 4.2247 -13.669
## - Internet   1   1.40464 4.5106 -12.555
## 
## Step:  AIC=-18.27
## Happiness.Score ~ GDP + Education + Internet + Wages
## 
##             Df Sum of Sq    RSS     AIC
## <none>                   3.2222 -18.273
## - Education  1   0.56199 3.7842 -17.540
## - Wages      1   0.58591 3.8081 -17.433
## + Life_exp   1   0.11630 3.1059 -16.898
## + Pollution  1   0.00159 3.2206 -16.282
## + EnergyE    1   0.00130 3.2209 -16.280
## - GDP        1   1.10292 4.3251 -15.269
## - Internet   1   2.08150 5.3037 -11.802
## 
## Call:
## lm(formula = Happiness.Score ~ GDP + Education + Internet + Wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87306 -0.24254 -0.05922  0.38237  0.73844 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.57015    0.55710   6.408 3.36e-05 ***
## GDP         -0.19587    0.09665  -2.027   0.0655 .  
## Education    0.11734    0.08111   1.447   0.1736    
## Internet     0.03429    0.01232   2.784   0.0165 *  
## Wages       -0.02874    0.01946  -1.477   0.1654    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5182 on 12 degrees of freedom
## Multiple R-squared:  0.5458, Adjusted R-squared:  0.3944 
## F-statistic: 3.605 on 4 and 12 DF,  p-value: 0.03754


  Africa
Predictors Estimates CI p
(Intercept) 3.57 2.36 – 4.78 <0.001
GDP -0.20 -0.41 – 0.01 0.066
Education 0.12 -0.06 – 0.29 0.174
Internet 0.03 0.01 – 0.06 0.017
Wages -0.03 -0.07 – 0.01 0.165
Observations 17
R2 / R2 adjusted 0.546 / 0.394


After applying the AIC function only for Africa, we decided to discard this continent for the analysis because its new equation has a negative GDP coefficient, which does not make sense. Even though the p-value is less than 0.05 (0.03), the equation does not have a good explanation by its factors (R-squared of 39%).


Moreover, to see the differences in the impact of each factor, we created a bar plot that will show us how relevant the coefficients are for these three continents that remain. It should be noted that we used pivot longer to make the process of plotting easier.



As we can see, the “Influence of significant variables in HS” graph created through ggplot shows the three main variables and continents. GDP per capita is the variable that presents the lowest disparity since the gap of coefficients between regions is not more than 0.025. Concerning Education, the impact in Europe is almost doubled compared to its influence in the other two. Nevertheless, Life Expectancy results are more significant, also by double, in the Americas. Although the gaps are wide or narrow, the coefficients present at least a moderate dispersion. According to these results, it is possible to say that people give extra importance to some variable depending on where they are. Finally, we can infer that the average people, no matter the place, prioritize Education more when they think about Happiness.




Now, we made multiple graphs to demonstrate relationships with the Happiness Score. Mainly, it was produced to show that factors looking atypical in the barplot can still have a good connection with the explained variable. The four graphs, overall Life expectancy in America, has a part where the confidence interval is not narrow, but what prevails in the graphs is that the shaded region does not expand to wide sizes, which means that the standard error is not high.



Conclusion

Final thoughts

All countries have in common that every citizen wants to be happy, despite having different cultures. Various factors affect Happiness in each country, but the relevance or weight of each variable is different in each continent. We have attempted to explore the factors that positively impact Happiness, influencing the score and ranking position of different countries in the world. Through a regression analysis, we noticed that the most critical factors for increasing Happiness are Life expectancy at birth, GDP per capita, and investment in Education, although the factors that seem most correlated with Happiness score are Life expectancy, GDP per capita, and Access to the Internet. We can see that Africa is the continent where most of the countries in it have the lowest Happiness scores. African countries lag behind other countries’ evaluations of life, well-being, and Happiness. The lower levels of Happiness compared to other countries in the world might be attributed to the struggles with poverty, suffering from relatively unstable economies, inequality, exclusion, and educational crisis, considering their long, turbulent history that created a cultural diversity. These issues can explain the lowest rates of Life expectancy, GDP per capita, and Education, resulting in the last position in the ranking of the Happiness score. In the end, because of its grand distortion in their observations, it was tough to work with it due to context features. Education is one of the principal factors that governments should focus on. Investment in education means shaping our future. After all, GDP per capita, access to the internet, and so more are derivatives of pedagogy because it is about training people who will take care of the world and develop new technologies that can increase the marginal productivity of each person. In the end, GDP per capita and Education appears to be a feedback loop.

According to regression estimates of the data, GDP per capita contributes to measuring Happiness, and it explains much of the variance in Happiness for countries at the top of the ranking. Rich countries can improve life satisfaction and economic well-being with better education, better health care, and healthier lifestyles, which increase life expectancy. While in the poorest region of the world, people are the least satisfied with their lives and more frustrated with the challenges of making progress. The lack of resources devoted to public health and safety in poor countries can explain the lower level of life expectancy.


Limitations

In this section, we wanted to share the limitations that we encountered during our project. Indeed and to start with, the geographical scope of our project was a significant challenge to us. We decided to work on the global level and monitoring all the countries was something we would not have expected to be that difficult. Then, we first had the idea to present the Happiness results and analysis from 2015 to 2019. But, we soon realized the amount of work that it would take us and the level of skills that it demanded and for this reason, we decided to reduce the time frame to 2017-2018. However, for most of our factors’ datasets, 2018 was a year with much less observations than other years and therefore, we finally decided to work concentrating on the year 2017.


Future Work

If we return to the same topic, we would like to work with more datasets of different factors to determine if any interesting variable can impact the Happiness score. Also, working with more information could be helpful to fill those blank spaces we deleted in the end.

Otherwise, it would be interesting to add more years to see how the Happiness score has been evolving and to have a more exact regression because, according to basic statistics, when you increase the sample, the error decreases (the error standard gets reduced, and the confidence interval is more potent).

Finally, we think that it can be interesing to find datasets that allow us to create a new variable, whose function would be to determine if, during that period, there was Covid-19. Through this, we can observe new behaviors because of the change in people’s perceptions by experiencing a worldwide event.